In [42]:
%matplotlib inline
In [2]:
import nltk
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this tutorial we’ll explore how to do that using Python, the Natural Language Toolkit (NLTK) and Gensim.
NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
NLTK was written by two eminent computational linguists, Steven Bird (Senior Research Associate of the LDC and professor at the University of Melbourne) and Ewan Klein (Professor of Linguistics at Edinburgh University). The NTLK library provides a combination of natural language corpora, lexical resources, and example grammars with language processing algorithms, methodologies and demonstrations for a very Pythonic "batteries included" view of natural language processing.
As such, NLTK is perfect for research-driven (hypothesis-driven) workflows for agile data science.
This notebook has a few dependencies, most of which can be installed via the python package manger - pip
.
Once you have Python and pip installed you can install NLTK from the terminal as follows:
~$ pip install nltk
~$ pip install matplotlib
~$ pip install beautifulsoup4
~$ pip install gensim
Note that these will also install Numpy and Scipy if they aren't already installed.
NLTK is a useful pedagogical resource for learning NLP with Python and serves as a starting place for producing production grade code that requires natural language analysis. It is also important to understand what NLTK is not.
NLTK provides a variety of tools that can be used to explore the linguistic domain but is not a lightweight dependency that can be easily included in other workflows, especially those that require unit and integration testing or other build processes. This stems from the fact that NLTK includes a lot of added code but also a rich and complete library of corpora that power the built-in algorithms.
Syntactic Parsing
The sem package
Lots of extra stuff (heavyweight dependency)
Knowing the good and the bad parts will help you explore NLTK further - looking into the source code to extract the material you need, then moving that code to production. We will explore NLTK in more detail in the rest of this notebook.
NLTK ships with a variety of corpora, let's use a few of them to do some work. To download the NLTK corpora, open a Python interpreter:
import nltk
nltk.download()
This will open up a window with which you can download the various corpora and models to a specified location. For now, go ahead and download it all as we will be exploring as much of NLTK as we can. Also take note of the download_directory
- you're going to want to know where that is so you can get a detailed look at the corpora that's included. I usually export an environment variable to track this. You can do this from your terminal:
~$ export NLTK_DATA=/path/to/nltk_data
In [3]:
# Take a moment to explore what is in this directory
dir(nltk)
Out[3]:
In [5]:
# Lists the various corpora and CorpusReader classes in the nltk.corpus module
for name in dir(nltk.corpus):
print(name)
if name.islower() and not name.startswith('_'): print(name)
In [7]:
# You can explore the titles with:
print(nltk.corpus.gutenberg.fileids())
In [8]:
# For a specific corpus, list the fileids that are available:
print(nltk.corpus.shakespeare.fileids())
text.Text()
The nltk.text.Text
class is a wrapper around a sequence of simple (string) tokens - intended only for the initial exploration of text usually via the Python REPL. It has the following methods:
You shouldn't use this class in production level systems, but it is useful to explore (small) snippets of text in a meaningful fashion.
For example, you can get access to the text from Hamlet as follows:
In [9]:
hamlet = nltk.text.Text(nltk.corpus.gutenberg.words('shakespeare-hamlet.txt'))
In [10]:
hamlet.concordance("king", 55, lines=10)
similar()
Given some context surrounding a word, we can discover similar words, e.g. words that that occur frequently in the same context and with a similar distribution: Distributional similarity:
Note ContextIndex.similar_words(word)
calculates the similarity score for each word as the sum of the products of frequencies in each context. Text.similar()
simply counts the number of unique contexts the words share.
In [11]:
print(hamlet.similar("marriage"))
austen = nltk.text.Text(nltk.corpus.gutenberg.words("austen-sense.txt"))
print()
print(austen.similar("marriage"))
In [15]:
hamlet.common_contexts(["king", "father"])
your turn, go ahead and explore similar words and contexts - what does the common context mean?
dispersion_plot()
NLTK also uses matplotlib and pylab to display graphs and charts that can show dispersions and frequency. This is especially interesting for the corpus of innagural addresses given by U.S. presidents.
In [16]:
inaugural = nltk.text.Text(nltk.corpus.inaugural.words())
inaugural.dispersion_plot(["citizens", "democracy", "freedom", "duty", "America"])
In [18]:
print(nltk.corpus.stopwords.fileids())
nltk.corpus.stopwords.words('english')
import string
print(string.punctuation)
In [19]:
corpus = nltk.corpus.brown
print(corpus.paras())
In [20]:
print(corpus.sents())
In [15]:
print(corpus.words())
In [16]:
print(corpus.raw()[:200]) # Be careful!
Your turn! Explore some of the text in the available corpora
In statistical machine learning approaches to NLP, the very first thing we need to do is count things - especially the unigrams that appear in the text and their relationships to each other. NLTK provides two excellent classes to enable these frequency analyses:
FreqDist
ConditionalFreqDist
And these two classes serve as the foundation for most of the probability and statistical analyses that we will conduct.
Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. Read more on Wikipedia.
First we will compute the following:
In [17]:
reuters = nltk.corpus.reuters # Corpus of news articles
counts = nltk.FreqDist(reuters.words())
vocab = len(counts.keys())
words = sum(counts.values())
lexdiv = float(words) / float(vocab)
print("Corpus has %i types and %i tokens for a lexical diversity of %0.3f" % (vocab, words, lexdiv))
In [18]:
counts.B()
Out[18]:
In [19]:
print(counts.most_common(40))
In [20]:
print(counts.max())
In [21]:
print(counts.hapaxes()[0:10])
In [22]:
counts.freq('stipulate') * 100
Out[22]:
In [23]:
counts.plot(50, cumulative=False)
In [24]:
# By setting cumulative to True, we can visualize the cumulative counts of the _n_ most common words.
counts.plot(50, cumulative=True)
In [24]:
from itertools import chain
brown = nltk.corpus.brown
categories = brown.categories()
counts = nltk.ConditionalFreqDist(chain(*[[(cat, word) for word in brown.words(categories=cat)] for cat in categories]))
for category, dist in counts.items():
vocab = len(dist.keys())
tokens = sum(dist.values())
lexdiv = float(tokens) / float(vocab)
print("%s: %i types with %i tokens and lexical diversity of %0.3f" % (category, vocab, tokens, lexdiv))
In [25]:
for ngram in nltk.ngrams(["The", "bear", "walked", "in", "the", "woods", "at", "midnight"], 5):
print(ngram)
NLTK is great at the preprocessing of raw text - it provides the following tools for dividing text into it's constituent parts:
sent_tokenize
: a Punkt sentence tokenizer:
This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.
word_tokenize
: a Treebank tokenizer
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize()
. It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize()
.
pos_tag
: a maximum entropy tagger trained on the Penn Treebank
There are several other taggers including (notably) the BrillTagger as well as the BrillTrainer to train your own tagger or tagset.
In [26]:
import bs4
from readability.readability import Document
# Tags to extract as paragraphs from the HTML text
TAGS = [
'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li'
]
def read_html(path):
with open(path, 'r') as f:
# Transform the document into a readability paper summary
html = Document(f.read()).summary()
# Parse the HTML using BeautifulSoup
soup = bs4.BeautifulSoup(html)
# Extract the paragraph delimiting elements
for tag in soup.find_all(TAGS):
# Get the HTML node text
yield tag.get_text()
In [27]:
for paragraph in read_html('fixtures/nrRB0.html'):
print(paragraph + "\n")
In [28]:
text = u"Medical personnel returning to New York and New Jersey from the Ebola-riddled countries in West Africa will be automatically quarantined if they had direct contact with an infected person, officials announced Friday. New York Gov. Andrew Cuomo (D) and New Jersey Gov. Chris Christie (R) announced the decision at a joint news conference Friday at 7 World Trade Center. “We have to do more,” Cuomo said. “It’s too serious of a situation to leave it to the honor system of compliance.” They said that public-health officials at John F. Kennedy and Newark Liberty international airports, where enhanced screening for Ebola is taking place, would make the determination on who would be quarantined. Anyone who had direct contact with an Ebola patient in Liberia, Sierra Leone or Guinea will be quarantined. In addition, anyone who traveled there but had no such contact would be actively monitored and possibly quarantined, authorities said. This news came a day after a doctor who had treated Ebola patients in Guinea was diagnosed in Manhattan, becoming the fourth person diagnosed with the virus in the United States and the first outside of Dallas. And the decision came not long after a health-care worker who had treated Ebola patients arrived at Newark, one of five airports where people traveling from West Africa to the United States are encountering the stricter screening rules."
for sent in nltk.sent_tokenize(text):
print(sent)
print()
In [29]:
for sent in nltk.sent_tokenize(text):
print(list(nltk.wordpunct_tokenize(sent)))
print()
In [30]:
for sent in nltk.sent_tokenize(text):
print(list(nltk.pos_tag(nltk.word_tokenize(sent))))
print()
All of these taggers work pretty well - but you can (and should train them on your own corpora).
We have an immense number of word forms as you can see from our various counts in the FreqDist
above - it is helpful for many applications to normalize these word forms (especially applications like search) into some canonical word for further exploration. In English (and many other languages) - morphological context indicate gender, tense, quantity, etc. but these sublties might not be necessary:
Stemming = chop off affixes to get the root stem of the word:
running --> run
flowers --> flower
geese --> geese
Lemmatization = look up word form in a lexicon to get canonical lemma
women --> woman
foxes --> fox
sheep --> sheep
There are several stemmers available:
- Lancaster (English, newer and aggressive)
- Porter (English, original stemmer)
- Snowball (Many languages, newest)
In [24]:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer
text = list(nltk.word_tokenize("The women running in the fog passed bunnies working as computer scientists."))
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
porter = PorterStemmer()
for stemmer in (snowball, lancaster, porter):
stemmed_text = [stemmer.stem(t) for t in text]
print(" ".join(stemmed_text))
In [59]:
from nltk.stem.wordnet import WordNetLemmatizer
# Note: use part of speech tag, we'll see this in machine learning!
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in text]
print(" ".join(lemmas))
Note that the lemmatizer has to load the WordNet corpus which takes a bit.
Typical normalization of text for use as features in machine learning models looks something like this:
In [28]:
import string
from nltk.corpus import wordnet as wn
## Module constants
lemmatizer = WordNetLemmatizer()
stopwords = set(nltk.corpus.stopwords.words('english'))
punctuation = string.punctuation
def tagwn(tag):
"""
Returns the WordNet tag from the Penn Treebank tag.
"""
return {
'N': wn.NOUN,
'V': wn.VERB,
'R': wn.ADV,
'J': wn.ADJ
}.get(tag[0], wn.NOUN)
def normalize(text):
for token, tag in nltk.pos_tag(nltk.wordpunct_tokenize(text)):
#if you're going to do part of speech tagging, do it here
token = token.lower()
if token in stopwords and token in punctuation:
continue
token = lemmatizer.lemmatize(token, tagwn(tag))
yield token
print(list(normalize("The eagle flies at midnight.")))
In [36]:
print(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("John Smith is from the United States of America and works at Microsoft Research Labs"))))
You can also wrap the Stanford NER system, which many of you are also probably used to using.
In [21]:
import os
from nltk.tag import StanfordNERTagger
# change the paths below to point to wherever you unzipped the Stanford NER download file
stanford_root = '/Users/benjamin/Development/stanford-ner-2014-01-04'
stanford_data = os.path.join(stanford_root, 'classifiers/english.all.3class.distsim.crf.ser.gz')
stanford_jar = os.path.join(stanford_root, 'stanford-ner-2014-01-04.jar')
st = StanfordNERTagger(stanford_data, stanford_jar, 'utf-8')
for i in st.tag("John Smith is from the United States of America and works at Microsoft Research Labs".split()):
print('[' + i[1] + '] ' + i[0])
In [31]:
for name in dir(nltk.parse):
if not name.startswith('_'): print(name)
Similar to how you might write a compiler or an interpreter; parsing starts with a grammar that defines the construction of phrases and terminal entities.
In [51]:
grammar = nltk.grammar.CFG.fromstring("""
S -> NP PUNCT | NP
NP -> N N | ADJP NP | DET N | DET ADJP
ADJP -> ADJ NP | ADJ N
DET -> 'an' | 'the' | 'a' | 'that'
N -> 'airplane' | 'runway' | 'lawn' | 'chair' | 'person'
ADJ -> 'red' | 'slow' | 'tired' | 'long'
PUNCT -> '.'
""")
In [60]:
def parse(sent):
sent = sent.lower()
parser = nltk.parse.ChartParser(grammar)
for p in parser.parse(nltk.word_tokenize(sent)):
yield p
for tree in parse("the long runway"):
tree.pprint()
tree[0].draw()
NLTK does come with some large grammars; but if constructing your own domain specific grammar isn't your thing; then you can use the Stanford parser (so long as you're willing to pay for it).
In [61]:
from nltk.parse.stanford import StanfordParser
# change the paths below to point to wherever you unzipped the Stanford NER download file
stanford_root = '/Users/benjamin/Development/stanford-parser-full-2014-10-31'
stanford_model = os.path.join(stanford_root, 'stanford-parser-3.5.0-models.jar')
stanford_jar = os.path.join(stanford_root, 'stanford-parser.jar')
st = StanfordParser(stanford_model, stanford_jar)
sent = "The man hit the building with the baseball bat."
for tree in st.parse(nltk.wordpunct_tokenize(sent)):
tree.pprint()
tree.draw()